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Abstract —Continued reliance on human operators for man¬ 
aging data centers is a major impediment for them from 
ever reaching extreme dimensions. Large computer systems 
in general, and data centers in particular, will ultimately be 
managed using predictive computational and executable models 
obtained through data-science tools, and at that point, the 
intervention of humans will be limited to setting high-level 
goals and policies rather than performing low-level operations. 
Data-driven autonomies, where management and control are 
based on holistic predictive models that are built and updated 
using generated data, opens one possible path towards limiting 
the role of operators in data centers. In this paper, we present 
a data-science study of a public Google dataset collected in a 
12K-node cluster with the goal of building and evaluating a 
predictive model for node failures. We use BigQuery, the big 
data SQL platform from the Google Cloud suite, to process 
massive amounts of data and generate a rich feature set 
characterizing machine state over time. We describe how an 
ensemble classifier can be built out of many Random Forest 
classifiers each trained on these features, to predict if machines 
will fail in a future 24-hour window. Our evaluation reveals 
that if we limit false positive rates to 5%, we can achieve true 
positive rates between 27% and 88% with precision varying 
between 50% and 72%. We discuss the practicality of including 
our predictive model as the central component of a data-driven 
autonomic manager and operating it on-line with live data 
streams (rather than off-line on data logs). All of the scripts 
used for BigQuery and classification analyses are publicly 
available from the authors’ website. 
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1. Introduction 

Modern data centers are the engines of the Internet that 
run e-commerce sites, cloud-based services accessed from 
mobile devices and power the social networks utilized each 
day by hundreds of millions of users. Given the pervasive¬ 
ness of these services in many aspects of our daily lives, 
continued availability of data centers is critical. And when 
continued availability is not possible, service degradations 
and outages need to be foreseen in a timely manner so 
as to minimize their impact on users. For the most part, 
current automated data center management tools are limited 
to low-level infrastructure provisioning, resource allocation, 
scheduling or monitoring tasks with no predictive capabili¬ 
ties. This leaves the brunt of the problem in detecting and 
resolving undesired behaviors to armies of operators who 


continuously monitor streams of data being displayed on 
monitors. Even at the highly optimistic rate of 26,000 servers 
managed per staffeiQ this situation is not sustainable if data 
centers are ever to reach exascale dimensions. Applying 
traditional autonomic computing techniques to large data 
centers is problematic since their complex system character¬ 
istics prohibit building a “cause-effect” system model that is 
essential for closing the control loop. Furthermore, current 
autonomic computing technologies are reactive and try to 
steer the system back to desired states only after undesirable 
states are actually entered — they lack predictive capabilities 
to anticipate undesirable states in advance so that proactive 
actions can be taken to avoid them in the first place. 

If data centers are the engines of the Internet, then data is 
their fuel and exhaust. Data centers generate and store vast 
amounts of data in the form of logs corresponding to various 
events and errors in the course of their operation. When 
these computing infrastructure logs are augmented with 
numerous other internal and external data channels including 
power supply, cooling, management actions such as software 
updates, server additions/removal, configuration parameter 
changes, network topology modifications, or operator actions 
to modify electrical wiring or change the physical locations 
of racks/server/storage devices, data centers become ripe 
to benefit from data science. The grand challenge is to 
exploit the toolset of modern data science and develop a new 
generation of autonomies that is data-driven, predictive and 
proactive based on holistic models that capture a data centre 
as an ecosystem including not only the computer system 
as such, but also its physical as well as its socio-political 
environment. 

In this paper we present the results of an initial study 
towards building predictive models for node failures in data 
centers. The study is based on a recent Google dataset 
containing workload and scheduler events emitted by the 
Borg cluster management system (H in a cluster of over 
12,000 nodes during a one-month period O, 13. We em¬ 
ployed BigQuery (41, a big data tool from the Google Cloud 
Platform that allows running SQL-like queries on massive 
data, to perform an exploratory feature analysis. This step 
generated a large number of features at various levels of 

^Delfina Eberly, Director of Data Center Operations at Facebook, speak¬ 
ing on “Operations at Scale” at the 7x24 Exchange 2013 Fall Conference. 


aggregation suitable for use in a machine learning classifier. 
The use of BigQuery has allowed us to complete the analysis 
for large amounts of data (table sizes up to 12TB containing 
over 100 billion rows) in reasonable amounts of time. 

For the classification study, we employed an ensemble 
that combines the output of multiple Random Forests (RF) 
classifiers, which themselves are ensembles of Decision 
Trees. RF were employed due to their proven suitability in 
situations were the number of features is large im and the 
classes are “unbalanced” O such that one of the classes 
consists mainly of “rare events” that occur with very low 
frequency. Although individual RF were better than other 
classifiers that were considered in our initial tests, they still 
exhibited limited performance, which prompted us to pursue 
an ensemble approach. While individual trees in RF are 
based on subsets of features, we used a combination of 
bagging and data subsampling to build the RF ensemble 
and tailor the methodology to this particular dataset. Our 
ensemble classifier was tested on several days from the trace 
data, resulting in very good performance on some days (up 
to 88% true positive rate, TPR, and 5% false positive rate, 
FPR), and modest performance on other days (minimum of 
27% TPR at the same 5% FPR). Precision levels in all cases 
remained between 50% and 72%. We should note that these 
results are comparable to other failure prediction studies in 
the field. 

The contributions of our work are severalfold. First, we 
advocate that modern data centers can be scaled to extreme 
dimensions only by eliminating reliance on human operators 
by adopting a new generation of autonomies that is data- 
driven and based on holistic predictive models. Towards this 
goal, we provide a failure prediction analysis for a dataset 
that has been studied extensively in the literature from other 
perspectives. The model we develop has very promising 
predictive power and has the potential to form the basis of a 
data-driven autonomic manager for data centers. Secondly, 
we propose an ensemble classification methodology tailored 
to this particular problem where subsampling is combined 
with bagging and precision-weighted voting to maximize 
performance. Thirdly, we provide one of the first exam¬ 
ples of BigQuery usage in the literature with quantitative 
evaluation of running times as a function of data size. All 
of the scripts used for BigQuery and classification analysis 
are publicly available from our website IT] under the GNU 
General Public License. 


The rest of this paper is organized as follows. The next 
section describes the process of building features from the 
trace data. Section |II^ describes our classification approach 
while our prediction results are presented in Secti on lrv| 
Related work is discussed in Section 13 In Section IVll we 
discuss the issues surrounding the construction of a data- 
driven autonomic controller based on our predictive model 
and argue its practicality. Section |VII| concludes the paper. 


II. Building the feature set with BigQuery 

The workload trace published by Google contains several 
tables monitoring the status of the machines, jobs and tasks 
during a period of approximately 29 days for a cluster 
of 12,453 machines. This includes task events (over 100 
million records, 17GB uncompressed), which follow the 
state evolution for each task, and task usage logs (over 
1 billion records, 178GB uncompressed), which report the 
amount of resources per task at approximately 5 minute 
intervals. We have used the data to compute the overall load 
and status of different cluster nodes at 5 minute intervals. 
This resulted in a time series for each machine and feature 
that spans the entire trace (periods when the machine was 
“up”). We then proceeded to obtain several features by 
aggregating measures in the original data. Due to the size of 
the dataset, this aggregation analysis was performed using 
BigQuery on the trace data directly from Google Cloud 
Storage. We used the bq command line tool for the entire 
analysis, and our scripts are available online through our 
Web site (3. 

From task events, we obtained several time series for each 
machine with a time resolution of 5 minutes. A total of 7 
features were extracted, which count the number of tasks 
currently running, the number of tasks that have started 
in the last 5 minutes and those that have finished with 
different exit statuses — evicted, failed, finished normally, 
killed or lost. From task usage data, we obtained 5 additional 
features (again at 5-minute intervals) measuring the load at 
machine level in terms of: CPU, memory, disk time, cycles 
per instruction (CPI) and memory accesses per instruction 
(MAI). This resulted in a total of 12 basic features that were 
extracted. For each feature, at each time step we consider 
the previous 6 time windows (corresponding to the machine 
status during the last 30 minutes) obtaining 72 features in 
total (12 basic features x 6 time windows). 

The procedure for obtaining the basic features was ex¬ 
tremely fast on the BigQuery platform. For task counts, we 
started with constructing a table of running tasks, where 
each row corresponds to one task and includes its start time, 
end time, end status and the machine it was running on. 
Starting from this table, we could obtain the time series for 
each feature for each machine, requiring between 139 and 
939 seconds on BigQuery per feature (one separate table 
per feature was obtained). The features related to machine 
load were computed by summing over all tasks running on 
a machine in each time window, requiring between 3585 
and 9096 seconds on BigQuery per feature. The increased 
execution time is due to the increased table sizes (over 
1 billion rows). We then performed a JOIN of all above 
tables to combine the basic features into a single table with 
104,197,215 rows (occupying 7GB). For this analysis, our 
experience allows us to judge BigQuery as being extremely 
fast; an equivalent computation would have taken months to 



Aggregation 

Average, SD, CV 

Correlation 

Ih 

166 (all features) 

45(6.5) 

12h 

864 (all features) 

258.8(89.1) 

24h 

284.6(86.6) 

395.6(78.9) 

48h 

593.6(399.2) 

987.2(590) 

72h 

726.6(411.5) 

1055.47(265.23) 

96h 

739.4(319.4) 

1489.2(805.9) 


Table I: Running times required by BigQuery for obtaining 
features aggregated over different time windows, for two 
aggregation types: computing averages, standard deviation 
(SD) and coefficient of variation (CV) versus computing 
correlations. For Ih and 12h windows, average, SD and 
CV were computed for all features in a single query. For 
all other cases, the mean (and standard deviation) of the 
required times per feature are shown. 


perform on a regular PC. 

A second level of aggregation meant looking at features 
over longer time windows rather than just the last 5 minutes. 
At each time step, 3 different statistics — averages, standard 
deviations and coefficients of variation — were computed 
for each basic feature obtained at the previous step. This 
was motivated by the suspicion that not only feature values 
but also their deviation from the mean could be important 
in understanding system behavior. Six different running 
windows of sizes 1, 12, 24, 48, 72 and 96 hours were used to 
capture behavior at various time resolutions. This resulted 
in 216 additional features (3 statistics x 12 features x 6 
window sizes). 

In order to generate these aggregated features, a set of 
intermediate tables were used. For each time point, these 
tables consisted of the entire set of data points to be 
averaged. For instance, for 1-hour averages, the table would 
contain a set of 6 values for each feature and for each time 
point, showing the evolution of the system over the past 
hour. While generating these tables was not time consuming 
(requiring between 197 and 960 seconds), their sizes were 
quite impressive: ranging from 143 GB (over 1 billion rows) 
for 1 hour up to 12.5 TB (over 100 billion rows) in the case 
of 96-hour window. Processing these tables to obtain the 
aggregated features of interest required significant resources 
and would not have been possible without the BigQuery 
platform. Even then, direct queries using a single GROUP 
BY operation to obtain all 216 features was not possible, 
requiring only one basic feature to be handled at a time and 
combining the results into a single table at the end. Table |T| 
shows statistics over the time required to obtain one feature 
for the different window sizes. 

Although independent feature values are important, an¬ 
other criterion that could be important for prediction is 
the relations that exist between different measures. Corre¬ 
lation between features is one such measure, with different 
correlation values indicating changes in system behavior. 
Hence we introduced a third level of aggregation of the 


data by computing correlations between a chosen set of 
feature pairs, again over various window sizes (1 to 96 
hours as before). We chose 7 features to analyze: number of 
running, started and failed jobs together with CPU, memory, 
disk time and CPI. By computing correlations between all 
possible pairings of the 7 features, we obtained a total of 21 
correlation values for each window size. This introduces 126 
additional features to our dataset. The BigQuery analysis 
started from the same intermediate tables as before and 
computed correlations for one feature pair at a time. As 
can be seen in Table this step was more time consuming, 
requiring greater time than the previous aggregation step, yet 
still remains manageable considering the size of the data. 
The amount of data processed for these queries ranged from 
49.6GB (per feature pair for 1-hour windows) to 4.33TB 
(per feature pair for 96-hour windows), resulting in a higher 
processing cost (5 USD per TB processed). Yet again, a 
similar analysis would not have been possible without the 
BigQuery platform. 

The Google trace also reports machine events. These are 
scheduler events corresponding to machines being added or 
removed from the pool of resources. Of particular interest 
are remove events, which can be due to two causes: ma¬ 
chine failures or machine software updates. The goal of this 
work is to predict REMOVE events due to machine failures, 
so the two causes have to be distinguished. Prompted by our 
discussions, publishers of the Google trace investigated the 
best way to perform this distinction and suggested to look 
at the length of time that machines remain down — the 
time from the REMOVE event of interest to the next ADD 
event for the same machine. If this “down time” is large, 
then we can assume that the REMOVE event was due to a 
machine failure, while if it is small, the machine was most 
likely removed to perform a software update. To ensure that 
an event considered to be a failure is indeed a real failure, 
we used a relatively-long “down time” threshold of 2 hours, 
which is greater than the time required for a typical software 
update. Based on this threshold, out of a total of 8,957 
REMOVE events, 2,298 were considered failures, and were 
the target of our predictive study. For the rest of the events, 
for which we cannot be sure of the cause, the data points 
in the preceding 24-hour window were removed completely 
from the dataset. An alternative would have been considering 
them part of the SAFE class, however this might not be true 
for some of the points. Thus, removing them completely 
ensures that all data labeled as SAFE are in fact SAFE. 

To the above features based mostly on load measures, 
we added two new features: the up time for each machine 
(time since the last corresponding ADD event) and number 
of REMOVE events for the entire cluster within the last hour. 
This resulted in a total of 416 features for 104,197,215 data 
points (almost 300GB of processed data). Figure displays 
the time series for 4 selected features (and the remove 
events) at one typical machine. 
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Figure 1: Four time series (4 out of 416 features) for one machine in the system. The features shown are: CPU for the last 
time window, CPU averages over 12 hours, CPU coefficient of variation for the last 12 hours and correlation between CPU 
and number of running jobs in the last 12 hours. Grey vertical lines indicate times of REMOVE events, some followed by 
gaps during which the machine was unavailable. The large gap from ^250 hours to ^370 hours is an example of a long 
machine downtime, following a series of multiple failures (cluster of grey vertical lines around 250 hours). In this case, the 
machine probably needed more extensive investigation and repair before being inserted back in the scheduler pool. 


III. Classieication approach 

The features obtained in the previous section were used 
for classification with the Random Forest (RF) classifier. 
The data points were separated into two classes: SAFE 
(negatives) and fail (positives). To do this, for each data 
point (corresponding to one machine at a certain time 
point) we computed time_to_remove as the time to the next 
REMOVE event. Then, all points with time_to_remove less 
than 24 hours were assigned to the class fail while all 
others were assigned to the class SAFE. We extracted all 
the FAIL data points corresponding to real failures (108,365 
data points) together with a subset of the SAFE class, 
corresponding to 0.5% of the total by random subsampling 
(544,985 points after subsampling). We used this procedure 
to deal with the fact that the SAFE class is much larger than 
the FAIL class and classifiers have difficulty learning patterns 
from very imbalanced datasets. Subsampling is one way of 
reducing the extent of this imbalance ISl. Even after this 
subsampling procedure, negatives are about five times the 
number of positives. These 653,350 data points (SAFE plus 
fail) formed the basis of our predictive study. 

Given the large number of features, some might be 
more useful than others, hence we explored two types of 
feature selection mechanisms. One was principal component 
analysis, which uses the original features to build a set of 
principal components — additional features that account 
for most of the variability in the data. Then one can use 


only the top principal components for classification, since 
those should contain the most important information. We 
trained classifiers with an increasing number of principal 
components, however the performance obtained was not 
better than using the original features. A second mechanism 
was to filter the original features based on their correlation 
to the time to the next failure event (time_to_remove above). 
Correlations were in the interval [—0.3,0.45], and we used 
only those features with absolute correlation larger than a 
threshold. We found that the best performance was obtained 
with a null threshold, which means again using all features. 
Hence, our attempts to reduce the feature set did not produce 
better results that the RF trained directly on the original 
features. One reason for this may be the fact that the RF itself 
performs feature selection when training the Decision Trees. 
It appears that the RF mechanism performs better in this 
case that correlation-based filtering or principal component 
analysis. 

To evaluate the performance of our approach, we em¬ 
ployed cross validation. Given the procedure we used to 
define the two classes, there are multiple data points cor¬ 
responding to the same failure (data over 24 hours with 5 
minutes resolution). Since some of these data points are very 
similar, choosing the train and test data cannot be done by 
selecting random subsets. While random selection may give 
extremely good prediction results, it is not realistic since we 
would be using test data which is too similar to the training 
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Figure 2: Cross validation approach: forward-in-time testing. 
Ten days were used for training and one day for testing. A set 
of 15 benchmarks (train/test pairs) were obtained by sliding 
the train/test window over the 29-day trace. 


data. This is why we opted for a time-based separation of 
train and test data. We considered basing the training on 
data over a 10-day window, followed by testing based on 
data over the next day with no overlap with the training 
data. Hence, the test day started 24 hours after the last 
training data point. The first two days were omitted in order 
to decrease the effect on aggregated features. In this manner, 
fifteen train/test pairs were obtained and used as benchmarks 
to evaluate our analysis (see Figure |^. This forward-in¬ 
time cross validation procedure ensures that classification 
performance is realistic and not an artifact of the structure 
of the data. Also, it mimics the way failure prediction would 
be applied in a live data center, where every day a model 
could be trained on past data to predict future failures. 

Given that many points from the FAIL class are very 
similar, which is not the case for the SAFE class due to 
initial subsampling, the information in the SAFE class is still 
overwhelmingly large. This prompted us to further subsam¬ 
ple the negative class in order to obtain the training data. 
This was performed in such a way that the ratio between 
SAFE and FAIL data points is equal to a parameterWe 
varied this parameter with the values {0.25,0.5,1,2,3,4} 
while using all of the data points from the positive class so 
as not to miss any useful information. This applied only for 
training data: for testing we always used all data from both 
the negative and positive classes (out of the base dataset of 
653,350 points). We also used RF of different sizes, with 
the number of Decision Trees varying from 2 to 15 with a 
step of 1 (resulting in 14 different values). 

As we will discuss in the following section, the perfor¬ 
mance of the individual classifiers, while better than random, 
was judged to be not satisfactory. This is why we opted 
for an ensemble method, which builds a series of classifiers 
and then selects and combines them to provide the final 
classification. Ensembles can enhance the power of low 
performing individual classifiers 0, especially if these are 
diverse El, do): if they give false answers on different data 
points (independent errors), then combining their knowledge 
can improve accuracy. To create diverse classifiers, one can 
vary the model parameters but also train them with different 
data (known as the bagging method 0). Bagging matches 


very well with subsampling to overcome the rare events 
problem, and in fact it has been shown to be effective for 
the class-imbalance problem 0. Hence, we adopt a similar 
approach to build our individual classifiers. Every time a 
new classifier is trained, a new training dataset is built by 
considering all the data points in the positive class and a 
random subset of the negative class. As described earlier, 
the size of this subset is defined by the fsafe parameter. 
By varying the value of fsafe and number of trees in the 
RE algorithm, we created diverse classifiers. The following 
algorithm details the procedure used to build the individual 
classifiers in the ensemble. 

Require: train_j)os, train_neg, ShuffteQ, TrainQ 
fsafe ^ { 0 . 25 , 0 . 5 , 1 , 2 , 3 , 4 } 
tree_count ^ { 2 .. 15 } 
classifiers ^ {} 
start ^ 0 

for all fs G fsafe do 

for all tc G tree_count do 

end ^ start + A * \train_j)os\ 
if end > \train_neg\ then 
start ^ 0 

end ^ start + A * \train_j)os\ 

Shuffle(train_neg) 

end if 

trainjdata ^ train_j?os -h train_neg[start\ end] 
classifier ^ Train(train_data, tc) 
append(classifiers, classifier) 
start ^ end 

end for 
end for 

We repeated this procedure 5 times, resulting in 5 classi¬ 
fiers for each combination of the parameters fsafe and RE 
size. This resulted in a total of 420 RE in the ensemble (5 
repetitions x 6 fsafe values x 14 RE sizes). 

Once the pool of classifiers is obtained, a combining 
strategy has to be used. Most existing approaches use the 
majority vote rule — each classifier votes on the class 
and the majority class becomes the final decision 0 . 
Alternatively, a weighted vote can be used, and we opted 
for precision-weighted voting. Eor most existing methods, 
weights correspond to the accuracy of each classifier on 
training data Km. In our case, performance on training 
data is close to perfect and accuracy is generally high, 
which is why we use precision on a subset of the test 
data. Specifically, we divide the test data into two halves: 
an individual test data set and an ensemble test data set. 
The former is used to evaluate the precision of individual 
classifiers and obtain a weight for their vote. The latter 
provides the final evaluation of the ensemble. All data 
corresponding to the test day was used, with no subsampling. 
Table shows the number of data points used for each 
benchmark for training and testing. While the parameter 
fsafe controlled the ratio SAFE/FAIL during training, FAIL 
instances were much less frequent during testing, varying 














Train 

Individual Test 

Ensemble Test 

Benchmark 

FAIL 

FAIL 

SAFE 

FAIL 

SAFE 

1 

41485 

2055 

9609 

2055 

9610 

2 

41005 

2010 

9408 

2011 

9408 

3 

41592 

1638 

9606 

1638 

9606 

4 

42347 

1770 

9597 

1770 

9598 

5 

42958 

1909 

9589 

1909 

9589 

6 

42862 

1999 

9913 

2000 

9914 

7 

41984 

1787 

9821 

1787 

9822 

8 

39953 

1520 

10424 

1520 

10424 

9 

37719 

1665 

10007 

1666 

10008 

10 

36818 

1582 

9462 

1583 

9463 

11 

35431 

1999 

9302 

1999 

9302 

12 

35978 

3786 

10409 

3787 

10410 

13 

35862 

2114 

9575 

2114 

9575 

14 

39426 

1449 

9609 

1450 

9610 

15 

40377 

1284 

9783 

1285 

9784 


Table II: Size of training and testing datasets. For training 
data, the number of SAFE data points is the number of FAIL 
multiplied by the fsafe parameter at each run. 


between 13% and 36% of the number of SAFE instances. 

To perform precision-weighted voting, we first applied 
each RF i obtained above to the individual test data and 
computed their precision pi as the fraction of points labeled 
FAIL that were actually failures. In other words, precision 
is the probability that an instance labeled as a failure is 
actually a real failure, which is why we decided to use this 
as a weight. Then we applied each RF to the ensemble test 
data. For each data point j in this set, each RF provided 
a classification o] (either 0 or 1 corresponding to SAFE or 
FAIL, respectively). The classification of the ensemble (the 
whole set of RF) was then computed as a continuous score 

( 1 ) 

i 

by summing individual answers weighted by their precision. 
Finally, these were normalized by the highest score in the 
data 

( 2 ) 

maXj[Sj) 

The resulting score s' is proportional to the likelihood that 
a data point is in the FAIL class — the higher the score, 
the more certain we are that we have an actual failure. The 
following algorithm outlines the procedure of obtaining the 
final ensemble classification scores. 

Require: classifiers, individual_test, ensemble_test 
Require: Precision^), ClassifyQ) 
classification_scores 
weights 4- {} 
for all c G classifiers do 

IV ^ Precision{c, individual Jest) 
weights[c^ ^ IV 
end for 

for all d G ensemble Jest do 
score ^ 0 

for all c G classifiers do 


score ^ score-\-weights[c]^Classify{c, d) 

end for 

append(classification_scores, score) 

end for 

max ^ Max(classification_scores) 

for all ^ G classification_scores do 
s ^ simax 

end for 

It assumes that the set of classifiers is available (classi¬ 
fiers), together with the two test data sets (individualJest 
and ensemble Jest) and procedures to compute precision of 
a classifier on a dataset (PrecisionQ) and to apply a classifier 
to a data point (Classifyi) which returns 0 for SAFE and 1 
for FAIL). 

IV. Classification results 

The ensemble classifier was applied to all 15 benchmark 
datasets. Training was done on an iMac with 3.06GHz Intel 
Core 2 Duo processor and 8GB of 1067MHz DDR3 memory 
running OSX 10.9.3. Training of the entire ensemble took 
between 7 and 9 hours for each benchmark dataset. 

Given that the result of the classification is a continuous 
score (Equation [^, and not a discrete label, evaluation was 
based on the Receiver Operating Characteristic (ROC) and 
Precision-Recall (PR) curves. A class can be obtained for a 
data point j from the score s' by using a threshold s*. A 
data point is considered to be in the FAIL class if s' > s*. 
The smaller s*, the more instances are classified as failures. 
Thus, by decreasing s* the number of true positives grows 
but so do the false positives. Similarly, at different threshold 
values, a certain precision is obtained. The ROC curve plots 
the True Positive Rate (TPR) versus the False Positive Rate 
(FPR) of the classifier as the threshold is varied. Similarly, 
The PR curve displays the precision versus recall (equal to 
TPR or Sensitivity). It is common to evaluate a classifier 
by computing the area under ROC (AUROC) and area 
under PR (AUPR) curves, which can range from 0 to 1. 



Figure 3: AUROC and AUPR on ensemble test data for all 
benchmarks. 






























(a) Worst case (Benchmark 4) 



(b) Best case (Benchmark 14) 


Figure 4: ROC and PR curves for worst and best performance across the 15 benchmarks (4 and 14, respectively). The 
vertical lines correspond to FPR of 1%, 5% and 10%. Note that parametercontrols the ratio of SAFE to FAIL data in 
the training datasets. 


AUROC values greater than 0.5 correspond to classifiers that 
perform better than random guesses, while AUPR represents 
an average classification precision, so, again, the higher the 
better. AUROC and AUPR do not depend on the relative 
distribution of the two classes, so they are particularly 
suitable for class-imbalance problems such as the one at 
hand. 

Figure shows AUROC and AUPR values obtained for 
all datasets, evaluated on the ensemble test data. For all 
benchmarks, AUROC values are very good, over 0.75 and up 
to 0.97. AUPR ranges between 0.38 and 0.87. Performance 
appears to increase, especially in terms of precision, towards 
the end of the trace. Lower performance that is observed 
for the first two benchmarks could be due to the fact that 


some of the aggregated features (those over 3 or 4 days) are 
computed with incomplete data at the beginning. 

To evaluate the effect of the different parameters and 
the ensemble approach. Figure displays the ROC and PR 
curves for the benchmarks that result in the worst and best 
results (4 and 14, respectively). Performance of the individ¬ 
ual classifiers in the ensemble are also displayed (as points 
in the ROC/PR space since their answer is categorical). We 
can see that individual classifiers result in very low FPR 
which is very important in predicting failures. Yet, in many 
cases, the TPR values are also very low. This means that 
most test data is classified as SAFE and very few failures 
are actually identified. 

TPR appears to increase when the fsafe parameter de- 
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Figure 5: Representation of the time axis for one machine. 
The SAFE and FAIL labels are assigned to time points 
based on the time to the next failure. Misclassification has 
different impacts depending on its position on the time axis. 
In particular, if we misclassify a data point close to the 
transition from SAFE to FAIL, the impact is lower than if 
we misclassify far from the boundary. The latter situation 
would mean fiagging a failure even if no failure will appear 
for a long time, or marking a machine as SAFE when failure 
is imminent. 


creases, but at the expense of the FPR and Precision. The 
plots show quantitatively the clear dependence between the 
three plotted measures and fsafe values. As the amount 
of SAFE training data decreases, the classifiers become 
less stringent and can identify more failures, which is an 
important result for this class-imbalance problem. Also, the 
plot shows clearly that individual classifiers obtained with 
different values for fsafe are diverse, which is critical for 
obtaining good ensemble performance. 

In general, the points corresponding to the individual 
classifiers are below the ROC and PR curves describing the 
performance of the ensemble. This proves that the ensemble 
method is better than the individual classifiers for this 
problem, which can be also due to their diversity. Some 
exceptions do appear (points above the solid lines), however 
for very low TPR (under 0.2) so in an area of the ROC/PR 
space that is not interesting from our point of view. We are 
interested in maximizing the TPR while keeping the FPR at 
bay. Specifically, the FPR should never grow beyond 5%, 
which means few false alarms. At this threshold, the two 
examples from Figure display TPR values of 0.272 (worst 
case) and 0.886 (best case), corresponding to precision 
values of 0.502 and 0.728 respectively. This is much better 
than individual classifiers at this level, both in terms of 
precision and TPR. For failure prediction, this means that 
between 27.2% and 88.6% of failures are identified as such, 
while from all instances labeled as failures, between 50.2% 
and 72.8% are actual failures. 

In order to analyze the implications of the obtained 
results in more detail, the relation between the classifier 
label and the exact time until the next remove event was 
studied for the data points. This is important because we 
originally assigned the label SAFE to all data points that are 
more than 24 hours away from a failure. According to this 
classification, a machine would be considered to be in SAFE 


state whether it fails in 2 weeks or in 2 days. Similarly, it is 
considered to be in FAIL state whether it fails in 23 hours or 
in 10 minutes. Obviously these are very different situations, 
and the impact of misclassification varies depending on the 
time to the next failure. Figure displays this graphically. 
As the time to the next failure decreases, a SAFE data point 
misclassified as FAIL counts less as a misclassification, since 
failure is actually approaching. Similarly, a FAIL data point 
labeled as SAFE has a higher negative impact when it is 
close to the point of failure. 

We would like to verify whether our classifier assigns 
correct and incorrect labels uniformly within each class, 
irrespective of the real time to the next failure. For this. 
Figure shows the distribution of the time-to-the-next- 
failure, in the form of boxplots, for true positives (TP), 
false negatives (FN), true negatives (TN) and false positives 
(FP), again for the worst and best cases (benchmarks 4 and 
14, respectively). We look at results obtained at 5% FPR 



(a) Benchmark 4, Positive class (b) Benchmark 4, Negative class 



TN (6) FP (35) 

(c) Benchmark 14, Positive class (d) Benchmark 14, Negative class 

Figure 6: Distribution of time to the next machine REMOVE 
for data points classified, divided into True Positive (TP), 
False Negative (FN), True Negative (TN) and False Positive 
(FP). The worst and best performance (benchmarks 4 and 14, 
respectively) out of the 15 runs are shown. Actual numbers 
of instances in each class are shown in parentheses on the 
horizontal axis. 






























values. A good result would be if misclassified positives 
are further in time from the point of failure compared to 
correctly classified failures. And misclassified negatives are 
closer to the failure point compared to correctly classified 
negatives. 

All positive instances, which are data points correspond¬ 
ing to real failures, are shown in the left panels (Figure 
and [^. These are divided into TP (failures correctly 
identified by the classifier) and FN (failures missed by the 
classifier). All data points have a time to the next remove 
event between 0 and 24 hours, due to the way the positive 
class was defined. If the classifier was independent of the 
time to the next failure, the two distributions shown for TP 
and FN would be very similar. However, the plots show 
that TP have, on average, lower times until the next event, 
compared to FN. This means that many of the positive data 
points that are misclassified are further in time from the 
actual failure moment compared to those correctly identified. 
This is good news, because it suggests that although data 
points are not recognized as imminent failure situations 
when there is still some time left before the actual fail¬ 
ure, correct classification may occur as the failure moment 
approaches. In fact, if we compute the fraction of failure 
events that are fiagged at least once in the preceding 24 hour 
window, we obtain values larger than the TPR computed 
at the data-point level, especially when prediction power is 
lowest (52.5% vs. 27.2% for benchmark 4 and 88.7% vs. 
88.6% for benchmark 14). 

Negative (safe) instances — data points that do not 
precede a remove event by less than 24 hours — can be 
divided into two classes: those for which there is a remove 
due to a real failure before the end of the trace and those 
for which there is none. For the latter, we have no means to 
estimate the time to the next failure event. So Figures [6b| and 
show the time to the next REMOVE only for the former, 
, those machines which will fail before the end of the 
trace. This is only a small fraction of the entire SAFE test 
data, especially for the benchmark with best performance, 
because it is only three days before the trace ends. However 
it still provides some indication on the time to the next 
failure of the negative class, divided into FP (negatives that 
are labeled as failures) and TN (negatives correctly labeled 
safe). As the figure shows, on average, times to the next 
failure are lower for FP compared to TN. This is again a 
good result because it means that many times the classifier 
gives false alarms when a failure is approaching, even if it 
is not strictly in the next 24 hours. 

V. Related work 

The publication of the Google trace data has triggered 
a fiurry of activity within the community including several 
with goals that are related to ours. Some of these provide 
general characterization and statistics about the workload 
and node state for the cluster IT^ . ifTSll . f]M and identify 


high levels of heterogeneity and dynamism in the system, 
especially when compared to grid workloads ca. User 
profiles ca and task usage shapes ini have also been 
characterized for this cluster. Other studies have applied 
clustering techniques for workload characterization, either 
in terms of jobs and resources ca, 091 or placement 
constraints 1^ . with the aim to synthesize new traces. A 
different type of usage is for validation of various workload 
management algorithms. Examples are ED where the trace 
is used to evaluate consolidation strategies, ||22l, 1^ where 
over-committing (overbooking) is validated, 1^ who take 
heterogeneity into account to perform provisioning or 1^ 
investigating checkpointing algorithms. 

System modeling and prediction studies using the Google 
trace data are far fewer than those performing characteri¬ 
zation or validation. An early attempt at system modeling 
based on this trace 1^ validates an event-based simulator 
using workload parameters extracted from the data, with 
good performance in simulating job status and overall sys¬ 
tem load. Host load prediction using a Bayesian classifier 
was analyzed in (271. Using CPU and RAM history, the 
mean load in a future time window is predicted, by dividing 
possible load levels into 50 discrete states. Here we perform 
prediction of machine failures for this cluster, which has not 
been attempted to date, to our knowledge. 

Failure prediction has been an active topic for many years, 
with a comprehensive review presented in (281. This work 
summarizes several methods of failure prediction in single 
machines, clusters, application servers, file systems, hard 
drives, email servers and clients, etc., dividing them into 
failure tracking, symptom monitoring or error reporting. The 
method introduced here falls into the symptom monitoring 
category, however elements of failure tracking and error 
reporting are also present through features like number of 
recent failures and job failure events. 

More recent studies concentrate on larger scale distributed 
systems such as HPC or clouds. For failure tracking meth¬ 
ods, an important resource is the failure trace archive (29l , 
a repository for failure logs and an associated toolkit that 
can enable integrated characterization of failures, such as 
distributions of inter-event times. Job failure in a cloud 
setting has been analyzed in (3Q]| . The naive Bayes classifier 
is used to obtain a probability of failure based on the job 
type and host name. This is applied on traces from Amazon 
EC2 running several scientific applications. The method 
reaches different performances on jobs from three different 
application settings, with ENs of 4%, 12% and 16% of total 
data points and corresponding EPs of 0%, 4% and 10% of 
total data points. This corresponds approximately to EPR of 
0%, 5.7% and 16.3%, and TPR of 86.6%, 61.2% and 58.9%. 
The performance we obtained with our method is within a 
similar range for most benchmarks, although we never reach 
their best performance. However, we are predicting machine 
and not job failures. 




A comparison of different classification tools for failure 
prediction in an IBM Blue Gene/L cluster is given in EB. 
In this work, they analyze Reliability, Availability and Ser¬ 
viceability (RAS) events using SVMs, neural networks, rule 
based classifiers and a custom nearest neighbor algorithm, 
trying to predict whether different event categories will 
appear. The custom nearest neighbor algorithm outperforms 
the others reaching 50% precision and 80% TPR. A similar 
analysis was also performed for a Blue Gene/Q cluster (321 • 
The best performance was again by the nearest neighbor 
classifier (10% FPR, 20% TPR). They never evaluated the 
Random Forest or ensemble algorithms. 

In (33l an anomaly detection algorithm for cloud com¬ 
puting was introduced. It employs Principal Component 
Analysis and selects the most relevant principal components 
for each failure type. They find that higher order components 
are better correlated with errors. Using threshold on these 
principal components, they identify data points outside the 
normal range. They study four types of failures: CPU- 
related, memory-related, disk-related and network-related 
faults, in a controlled in-house system with fault injection 
and obtain very high performance, 91.4% TPR at 3.7% FPR. 
On a production trace (the same Google trace we are using) 
they predict task failures at 81.5% TPR and 27% FPR (at 5% 
FPR, TPR is down to about 40%). In our case, we studied 
the same trace but looking at machine failures as opposed 
to task failures, and obtained TRP values between 27% and 
88% at 5% FPR. 

All above-mentioned failure prediction studies concentrate 
on types of failures or systems different from ours and obtain 
variable results. In all cases, our predictions compare well 
with prior studies, with our best result being better than 
most. 

VI. Discussion 

Here we consider the possibility of including our predic¬ 
tive model in a data-driven autonomic controller for use in 
data centers. In such a scenario, both model building and 
model updating would happen on-line on data that is being 
streamed from various sources. From a technical point of 
view, on-line use would require a few changes to our model 
workflow. 

To build the model, all features have to be computed on¬ 
line. Log data can be collected in a BigQuery table using the 
streaming API. As data is being streamed, features have to be 
computed at 5 minute intervals. Both basic and aggregated 
features (averages, standard deviations, coefficients of varia¬ 
tion and correlations) have to be computed, but only for the 
last time window (previous time windows are already stored 
in a dedicated table). Basic features are straightforward to 
compute requiring negligible running time since they can 
be computed using accumulators as the events come in. For 
aggregated features, parallelization can be employed, since 
they are all independent. In our experiments, correlation 


computation was most time consuming, with an average 
time to compute one correlation feature over the longest 
time window taking 1489.2 seconds for all values over 
the 29 days (Table |^. For computing a single value (for 
the newly streamed data), the time required should be on 
average under 0.2 seconds. This estimate corresponds to a 
linear dependence between the number of time windows 
and computation time and offers an upper bound for the 
time required. If this stage is performed in parallel on 
BigQuery, this would also be the average time to compute 
all 126 correlation features (each feature can be computed 
independently so speedup would be linear). 

In terms of dollar costs, we expect figures similar to those 
during our tests — about 60 USD per day for storage and 
analysis. To this, the streaming costs would have to be added 
— currently 1 cent per 200MB. For our system, the original 
raw data is about 200GB for all 29 days, so this would 
translate to approximately 7GB of data streamed every day 
for a system of similar size, leading to a cost of about 35 
cents per day. At all times, only the last 12 days of features 
need to be stored, which keeps data size relatively low. In 
our analysis, for all 29 days, the final feature table requires 
295GB of BigQuery storage, so 12 days would amount to 
about 122GB of data. 

When a new model has to be trained (e.g., once a day), 
all necessary features are already computed. One can use 
an infrastructure like the Google Compute Engine to train 
the model, which would eliminate the need to download the 
data and would allow for training of the individual classifiers 
of the ensemble in parallel. In our tests, the entire ensemble 
took under 9 hours to train, with each RF requiring at most 3 
minutes. Again, since each classifier is independent, training 
all classifiers in parallel would take under 3 minutes as well 
(provided one can use as many CPUs as there are RFs — 420 
in our study). Combining the classifiers takes a negligible 
amount of time. 

All in all, we expect the entire process of updating the 
model to take under 5 minutes if full parallelization is used 
both for feature computation and training. Application of the 
model on new data requires a negligible amount of time once 
features are available. This makes the method very practical 
for on-line use. Here we have described a cloud computing 
scenario, however, given the relatively limited computation 
and storage resources that are required, we believe that more 
modest clusters can also be used for monitoring, model 
updating and prediction. 

VII. Conclusions 

We have presented a predictive study for failure of nodes 
in a Google cluster based on a published workload trace. 
Feature extraction from raw data was performed using 
BigQuery, the big data cloud platform from Google which 
enables SQL-like queries. A large number of features were 
generated and an ensemble classifier was trained on log data 


for 10 days and tested on the following non-overlapping day. 
The length of the trace allowed repeating this process 15 
times producing 15 benchmark datasets, with the last day in 
each dataset being used for testing. 

The BigQuery platform was extremely useful for obtain¬ 
ing the features from log data. Although limits were found 
for JOIN and GROUP BY statements, these were circumvented 
by creating intermediate tables, which sometimes contained 
over 12TB of data. Even so, features were obtained with 
reduced running times, with overall cost for the entire 
analysis processing one month worth of logs, coming in at 
under 2000 USE|^ resulting in a daily cost of just over 60 
USD. 

Classification performance varied from one benchmark to 
another, with Area-Under-the-ROC curve measure varying 
between 0.76 and 0.97 while Area-Under-the-Precision- 
Recall curve measure varying between 0.38 and 0.87. This 
corresponded to true positive rates in the range 27%-88% 
and precision between 50% and 72% at a false positive rate 
of 5%. In other words, this means that in the worst case, we 
were able to identify 27% of failures, while if a data point 
was classified as a failure, we could have 50% confidence 
that we were looking at a real failure. For the best case, we 
were able to identify almost 90% of failures and 72% of 
instances classified as failures corresponded to real failures. 
All this, at the cost of having false alarms 5% of the time. 

Although not perfect, our predictions achieve good per¬ 
formance levels. Results could be improved by changing the 
subsampling procedure. Here, only a subset of the SAFE 
data was used due to the large number of data points 
in this class, and a random sample was extracted from 
this subset when training each classifier in the ensemble. 
However, one could subsample every time from the full set. 
However, this would require greater computational resources 
for training, since a single workstation cannot process 300 
GB of data at a time. Training times could be reduced 
through parallelization, since the problem is embarrassingly 
parallel (each classifier in the ensemble can be trained 
independently from the others). These improvements will 
be pursued in the future. Introduction of additional features 
will also be explored, to take into account in a more explicit 
manner the interaction between machines. BigQuery will 
be used to obtain interactions between machines from the 
data, which will result in networks of nodes. Changes in 
the properties of these networks over time could provide 
important information on possible future failures. 

The method presented here is very suitable for on-line 
use. A new model can be trained every day, using the last 
12 days of logs. This is the scenario we simulated when we 
created the 15 test benchmarks. The model would be trained 
from 10 days of data and tested on one non-overlapping day, 
exactly like in the benchmarks (Figure [^. Then, it would 

^Based on current Google BigQuery pricing. 


be applied for one day to predict future failures. The next 
day a new model would be obtained from new data. Each 
time, only the last 12 days of data would be used, rather 
than increasing the amount of training data. This to account 
for the fact that the system itself and the workload can 
change in time, so old data may not match current system 
behavior. This would ensure that the model is up to date 
with the current system state. The testing stage is required 
for live use for two reasons. First, part of the test data is 
used to build the ensemble (prediction-weighted voting). 
Secondly, the TPR and precision values on test data can 
help system administrators make decisions on the criticality 
of the predicted failure. 
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